3  Relatedness in the era of Machine Learning

When two technologies (or products, activities, etc) share similar sets of input requirements (knowledge, resources, etc), we qualify them as related (Hausmann and Hidalgo 2011). This empirical observation has been confirmed in many areas in different streams of literature and formalised in the Principle of Relatedness(PoR) (Hidalgo et al. 2018). However, deriving concrete policy implications from this principle has proven far from straightforward (Hidalgo 2023; Li and Neffke 2024). The PoR is essentially a framework that formalises a qualitative intuition that’s been present in various streams of literature: The industrial fabric in a geographic location matters (Hidalgo 2021; Hidalgo and Hausmann 2009). This framework enables researchers to derive different metrics that quantify path dependency and, therefore, infer more granular and pragmatic policy recommendations (Li and Neffke 2024). However, we often find in the literature that many studies refrain from investigating beyond the identification of path dependencies(Hidalgo 2023). Although such identification may prove interesting at times, the entire idea of the PoR is to be used to break free from the path dependency curse and focus on unique regional paths that promote diversification (Imbs and Wacziarg 2003). Diversification is the endpoint because it creates different sets of non-fungible tacit and non-tacit knowledge/capacity (Collins 1974) that can be compounded over time and across industries to create value that drives regional growth and development (Dosi 1982; Weitzman 1998). Knowledge, in all its forms, is the driver of the PoR policy implications (P.-A. Balland and Boschma 2022; P.-A. Balland et al. 2019). And although identifying promising areas of knowledge(related or unrelated) is a useful exercise. Figuring out the unique elements that dictate the dynamics of knowledge flows is at the heart of the industrial policies in this context (Nomaler and Verspagen 2024). Moreover, the PoR is also complemented by the Economic Complexity paradigm(EC) pioneered for the first time in (Hidalgo and Hausmann 2009). EC is a methodological framework that builds on the PoR and frames economies as complex systems. The idea is simple: an economy, regardless of its scale, is a complex system that might be impossible to determine the entirety of its components. But if we quantify the interactions between different systems and their different components, then we can estimate indices and metrics that capture most of the variation. In this sense, the PoR quantifies path dependency patterns(via metrics like relatedness and proximity), and EC quantifies the sophistication of specialisation patterns(via metrics like complexity and fitness)1. One can also describe relatedness as a variation of a recommendation system and complexity as a dimensionality reduction exercise. However, there’s still no general consensus on the reliability of any given methodology for both exercises, regardless of the popularity of one or the other. What there’s a consensus on, however, is that these frameworks and the toolbox they provide can be improved further as stated in C. Pinheiro (2025). Regardless, EC and PoR were adopted in many policy papers such as Zaccaria et al. (2018), E and A (2021) and G, D, and L (2025).

In this context, the literature provides more threads of ideas that target a deeper understanding and analysis of complex economic systems. For instance, investigations and studies regarding unrelated diversification (Flávio L. Pinheiro et al. 2022; Boschma et al. 2023), geographic inequalities (Flavio L. Pinheiro et al. 2025; Hartmann et al. 2017), emerging industries/technologies (C. Lee et al. 2018; Fessina et al. 2024), and diversification strategies (Alshamsi, Pinheiro, and Hidalgo 2018), among others, are pioneering the effort to bridge different gaps in theory, policy implications, and methodology. These ideas, among others, suggest that investing in unrelated activities can yield greater value and help break free from the path‐dependency curse—a phenomenon the literature shows exacerbates regional inequalities. Thus, one of the challenges is to quantify how to expand related activities beyond path dependency(strategy), which activity/sector to aim for(target), and the requirements for such an investment to be fruitful(condition). The scope of our study is expanding ideas around diversification strategies and conditions.

Policy makers often face a difficult choice when deciding on industrial upgrading. The first is to take advantage of existing local capacity and knowledge(related diversification). This choice is presumably the easiest one, since the local economy already has what it takes for the implementation (Boschma 2017). The second is to invest in building new capacity/knowledge(unrelated diversification) with all the risks that such a gamble accommodates (Coniglio et al. 2021). These choices have been at the centre of different theories in development economics(big push, forward/backward linkages (Rosenstein-Rodan 1943; Hirschman 1958), etc). However, we argue on the side of (C. Pinheiro 2025) that the basis of this narrative is incomplete since it contains an implicit assumption that is often overlooked: it’s easier for an economic system to diversify into a related activity than an unrelated one. But how do we assess the ease of diversification, accounting for its level of relatedness?

In this paper, we propose a measure that quantifies the ease of diversification of an economic system agnostic to the level of relatedness of an activity. We believe that our approach accommodates this assumption explicitly and capture all possible diversification paths be it related or unrelated. Additionally, we extend our contribution by assessing how different classical socio-economic factors influence the ease of diversification. The remainder of the paper is organised as follows: we first introduce the data, then we present the methodology for each phase of our work. Then we present our results, and we conclude this work with a discussion.

K. Lee and Malerba (2017) show that for every stage of industrial maturity different level of diversity among other “initial-conditions” is needed. The main idea is that local capacities are necessary but insufficient condition for related diversification. This has been formalised explicitly in (Hausmann and Hidalgo 2011). where the authors define the “quiescence trap”, which can be observed when a country with few capabilities face low incentive to accumulate new ones. Additionally, in a broader policy context, and even if we assume that these initial conditions are met locally, and related diversification is feasible, it may simply exacerbate the regional disadvantages. Indeed related diversification has been observed to increase the gaps between locations (Mealy and Coyle 2022; Flavio L. Pinheiro et al. 2025). Essentially, if we have two locations, one already has a range of complex capacities, the other doesn’t. If policy only backs what each location already does well, the complex location keeps getting ever more complex, drawing more investment and talent, while the other one falls further behind exacerbating inequality. Moreover, there’s no consensus on one way to quantify relatedness. The literature usually relies on the co-occurrence matrix to construct the relatedness between products/technologies etc. (Coniglio et al. 2021) point out that most studies in this context do not differentiate between random co-occurrences, and co-occurrences that are due to related capacities and proposes a test to investigate these significance of these relationships. Similarly, the proposed methodology in (Albora et al. 2023) responds to the same criticism and argue that the number of products/technologies almost always outnumber that of regions/locations, therefore the information extracted from the co-occurrence matrix is at best a random walk.

We use Data from the European Patent Office, which contains details on patent applications from 1978 to 2021. The EPO provides a rigorous and detailed classification of each patent application up to 8 or more digits. In our case, we consider the IPC classifications, but since these classifications are extremely granular and are considerably larger than the regions observed (at 8000+ classes), we limit our data to the 4th digit of the IPC classification. These 4 digits contain 3 layers of information with which we can define a given technology, a section denoted by a letter, a class denoted by two digits, and a subclass denoted by another letter. Thus, an IPC class/technology such as F16H is structured hierarchically: Section F covers Mechanical engineering (including lighting, heating, weapons, and blasting), Class 16 pertains to engineering elements and general methods for producing and transmitting mechanical power, and Subclass H specifically addresses gears, shaft connections, and gearing for conveying rotary motion. With such a subset, we ended up with 641 distinct technologies. Additionally, the same data also provides details on where the applications were made, we capture these details at the NUTS2 level2 for 34 European countries within and outside the European Union, spanning across 345 regions. Additionally we also use data from the Eurostat database to incorporate regional level socio-economic factors which are detailed further in section ?sec-factors and summarised in ?tbl-sum.

Furthermore, we quantify regions’ specialisation by means of the Revealed Comparative Advantage(RCA) (Balassa 1965). In our context, the RCA measures the region’s relative specialization level in a given technology, which enables us to capture both expertise and diversity when we aggregate all the technologies for each region. This measure, also known as the Balassa index, proved useful in determining complex and non-linear relationships between products/activities. Although the RCA is mainly designed for use with international trade data, it has also been adopted in the literature on the geography of innovation. Simply put, the RCA quantifies simultaneously the relative level and the quality of co-occurrence, which reduces the noise in the data. Although some papers criticise the use of the RCA with patent classes (P. Balland and Boschma 2019; Diodato et al. 2023), we think it fits our objective in capturing meaningful relationships between technologies. We compute these RCA measures to obtain, for each year, a matrix denoting the regions in its rows and the technologies in its columns. We formalize it as follows: Let \(X_{r,t,y}\) be the measure of activity (patent counts) of region \(r\) in technology \(t\) during year \(y\). Where \(\mathcal{T}\) is the set of technologies, \(\mathcal{Y}\) is the set of years, \(\mathcal{R}\) is the set of regions, and \(\mathcal{C}\) is the set of countries, such that:

\[ \mathcal{T} = \{\,t : 1 \le t \le N_T\},\quad \mathcal{Y} = \{\,y : 1 \le y \le N_Y\},\quad \mathcal{R} = \{\,r : 1 \le r \le N_R\},\quad \mathcal{C} = \{\,c : 1 \le c \le N_C\}. \]

And \(N_T, N_Y, N_R, N_C\) are the total counts of technologies, years, regions and countries.

The RCA of region \(r\) in technology \(t\) in year \(y\) is

\[ \mathrm{RCA}_{r,t,y} = \frac{\displaystyle\frac{X_{r,t,y}}{\sum_{t'} X_{r,t',y}}} {\displaystyle\frac{\sum_{r'} X_{r',t,y}}{\sum_{r',t'} X_{r',t',y}}} = \frac{X_{r,t,y}\,\sum_{r',t'}X_{r',t',y}} {\bigl(\sum_{t'}X_{r,t',y}\bigr)\,\bigl(\sum_{r'}X_{r',t,y}\bigr)} \]

For each year \(y\), we then assemble the RCA matrix \(\mathbf{R}^{(y)}\) whose \((r,t)\)-entry is \(\mathrm{RCA}_{r,t,y}\):

\[ \mathbf{R}^{(y)} = \bigl[\mathrm{RCA}_{r,t,y}\bigr]_{r=1,\dots,N_{R}}^{t=1,\dots,N_{T}} \]

These yearly measures are essential for us, since our entire approach depends on different manipulations around these stacked matrices.


  1. We will use the terms relatedness and complexity to refer to these two dimensions moving forward following the literature nomenclature↩︎

  2. With the exception of Belgium and the United Kingdom who were included at NUTS1 level↩︎